model rollout
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States (0.04)
- Asia > China (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States (0.04)
- Asia > China (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Why long model-based rollouts are no reason for bad Q-value estimates
Wissmann, Philipp, Hein, Daniel, Udluft, Steffen, Tresp, Volker
This paper explores the use of model-based offline reinforcement learning with long model rollouts. While some literature criticizes this approach due to compounding errors, many practitioners have found success in real-world applications. The paper aims to demonstrate that long rollouts do not necessarily result in exponentially growing errors and can actually produce better Q-value estimates than model-free methods. These findings can potentially enhance reinforcement learning techniques.
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
Optimistic Model Rollouts for Pessimistic Offline Policy Optimization
Zhai, Yuanzhao, Li, Yiying, Gao, Zijian, Gong, Xudong, Xu, Kele, Feng, Dawei, Bo, Ding, Wang, Huaimin
Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- (2 more...)
COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL
Wang, Xiyao, Zheng, Ruijie, Sun, Yanchao, Jia, Ruonan, Wongkamjan, Wichayaporn, Xu, Huazhe, Huang, Furong
Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$.
- North America > United States > Maryland (0.14)
- Asia (0.14)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning
Luo, Fan-Ming, Xu, Tian, Cao, Xingchen, Yu, Yang
Learning a precise dynamics model can be crucial for offline reinforcement learning, which, unfortunately, has been found to be quite challenging. Dynamics models that are learned by fitting historical transitions often struggle to generalize to unseen transitions. In this study, we identify a hidden but pivotal factor termed dynamics reward that remains consistent across transitions, offering a pathway to better generalization. Therefore, we propose the idea of reward-consistent dynamics models: any trajectory generated by the dynamics model should maximize the dynamics reward derived from the data. We implement this idea as the MOREC (Model-based Offline reinforcement learning with Reward Consistency) method, which can be seamlessly integrated into previous offline model-based reinforcement learning (MBRL) methods. MOREC learns a generalizable dynamics reward function from offline data, which is subsequently employed as a transition filter in any offline MBRL method: when generating transitions, the dynamics model generates a batch of transitions and selects the one with the highest dynamics reward value. On a synthetic task, we visualize that MOREC has a strong generalization ability and can surprisingly recover some distant unseen transitions. On 21 offline tasks in D4RL and NeoRL benchmarks, MOREC improves the previous state-of-the-art performance by a significant margin, i.e., 4.6% on D4RL tasks and 25.9% on NeoRL tasks. Notably, MOREC is the first method that can achieve above 95% online RL performance in 6 out of 12 D4RL tasks and 3 out of 9 NeoRL tasks.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (18 more...)